11 research outputs found
Did You Mean...? Confidence-based Trade-offs in Semantic Parsing
We illustrate how a calibrated model can help balance common trade-offs in
task-oriented parsing. In a simulated annotator-in-the-loop experiment, we show
that well-calibrated confidence scores allow us to balance cost with annotator
load, improving accuracy with a small number of interactions. We then examine
how confidence scores can help optimize the trade-off between usability and
safety. We show that confidence-based thresholding can substantially reduce the
number of incorrect low-confidence programs executed; however, this comes at a
cost to usability. We propose the DidYouMean system which better balances
usability and safety.Comment: 9 pages. arXiv admin note: substantial text overlap with
arXiv:2211.0744
Calibrated Interpretation: Confidence Estimation in Semantic Parsing
Sequence generation models are increasingly being used to translate natural
language into programs, i.e. to perform executable semantic parsing. The fact
that semantic parsing aims to predict programs that can lead to executed
actions in the real world motivates developing safe systems. This in turn makes
measuring calibration -- a central component to safety -- particularly
important. We investigate the calibration of popular generation models across
four popular semantic parsing datasets, finding that it varies across models
and datasets. We then analyze factors associated with calibration error and
release new confidence-based challenge splits of two parsing datasets. To
facilitate the inclusion of calibration in semantic parsing evaluations, we
release a library for computing calibration metrics.Comment: TACL Camera-read
Rephrase, Augment, Reason: Visual Grounding of Questions for Vision-Language Models
An increasing number of vision-language tasks can be handled with little to
no training, i.e., in a zero and few-shot manner, by marrying large language
models (LLMs) to vision encoders, resulting in large vision-language models
(LVLMs). While this has huge upsides, such as not requiring training data or
custom architectures, how an input is presented to a LVLM can have a major
impact on zero-shot model performance. In particular, inputs phrased in an
underspecified way can result in incorrect answers due to factors like missing
visual information, complex implicit reasoning, or linguistic ambiguity.
Therefore, adding visually grounded information to the input as a preemptive
clarification should improve model performance by reducing underspecification,
e.g., by localizing objects and disambiguating references. Similarly, in the
VQA setting, changing the way questions are framed can make them easier for
models to answer. To this end, we present Rephrase, Augment and Reason
(RepARe), a gradient-free framework that extracts salient details about the
image using the underlying LVLM as a captioner and reasoner, in order to
propose modifications to the original question. We then use the LVLM's
confidence over a generated answer as an unsupervised scoring function to
select the rephrased question most likely to improve zero-shot performance.
Focusing on two visual question answering tasks, we show that RepARe can result
in a 3.85% (absolute) increase in zero-shot performance on VQAv2 and a 6.41%
point increase on A-OKVQA. Additionally, we find that using gold answers for
oracle question candidate selection achieves a substantial gain in VQA accuracy
by up to 14.41%. Through extensive analysis, we demonstrate that outputs from
RepARe increase syntactic complexity, and effectively utilize vision-language
interaction and the frozen language model in LVLMs.Comment: 22 pages, 4 figures, Code: https://github.com/archiki/RepAR
Why Did the Chicken Cross the Road? Rephrasing and Analyzing Ambiguous Questions in VQA
Natural language is ambiguous. Resolving ambiguous questions is key to
successfully answering them. Focusing on questions about images, we create a
dataset of ambiguous examples. We annotate these, grouping answers by the
underlying question they address and rephrasing the question for each group to
reduce ambiguity. Our analysis reveals a linguistically-aligned ontology of
reasons for ambiguity in visual questions. We then develop an English
question-generation model which we demonstrate via automatic and human
evaluation produces less ambiguous questions. We further show that the question
generation objective we use allows the model to integrate answer group
information without any direct supervision.Comment: ACL 2023. Code and data: https://github.com/esteng/ambiguous_vq
The Curious Case of Control
Children acquiring English make systematic errors on subject control
sentences even after they have reached near-adult competence (C. Chomsky,
1969), possibly due to heuristics based on semantic roles (Maratsos, 1974).
Given the advanced fluency of large generative language models, we ask whether
model outputs are consistent with these heuristics, and to what degree
different models are consistent with each other. We find that models can be
categorized by behavior into three separate groups, with broad differences
between the groups. The outputs of models in the largest group are consistent
with positional heuristics that succeed on subject control but fail on object
control. This result is surprising, given that object control is orders of
magnitude more frequent in the text data used to train such models. We examine
to what degree the models are sensitive to prompting with agent-patient
information, finding that raising the salience of agent and patient relations
results in significant changes in the outputs of most models. Based on this
observation, we leverage an existing dataset of semantic proto-role annotations
(White, et al. 2020) to explore the connections between control and labeling
event participants with properties typically associated with agents and
patients.Comment: 11 page